[FLINK-39306][flink-autoscaler] Non-source vertices do not use per-second rate metrics, producing inaccurate scaling decisions by Dennis-Mircea · Pull Request #1078 · apache/flink-kubernetes-operator

Dennis-Mircea · 2026-03-23T17:22:50Z

What is the purpose of the change

Fixes the inaccurate scaling metric computation for non-source vertices by enabling per-second rate metrics (numRecordsInPerSecond / numRecordsOutPerSecond) collection and consumption across the full autoscaler pipeline, and replaces endpoint-only getRate() with spike-resilient alternatives for gauge metrics like LAG.

JIRA: https://issues.apache.org/jira/browse/FLINK-39306

Brief change log

FlinkMetric: Added NUM_RECORDS_IN_PER_SEC and NUM_RECORDS_OUT_PER_SEC enum entries to match Flink's numRecordsInPerSecond / numRecordsOutPerSecond task-level metrics.
ScalingMetric: Added NUM_RECORDS_IN_PER_SECOND and NUM_RECORDS_OUT_PER_SECOND scaling metric entries for storing per-second rates in the metrics history.
ScalingMetricCollector: Extended getFilteredVertexMetricNames to request NUM_RECORDS_IN_PER_SEC and NUM_RECORDS_OUT_PER_SEC for non-source vertices (previously only source-specific metrics were requested).
ScalingMetrics: Extended computeDataRateMetrics to store NUM_RECORDS_IN_PER_SECOND and NUM_RECORDS_OUT_PER_SECOND in the collected scaling metrics for non-source vertices when available.
ScalingMetricEvaluator:
- Introduced getAverageWithRateFallback(perSecondMetric, accumulatedMetric, ...) - tries getAverage(perSecondMetric) first (direct per-second rate from Flink), falls back to getRate(accumulatedMetric) (endpoint-based delta from accumulated counters) when the per-second metric is unavailable.
- Introduced getAverageRate(metric, ...) - computes the average of per-interval deltas across the full metrics window, replacing the spike-susceptible endpoint-only getRate() for gauge metrics like LAG.
- Replaced all getRate(NUM_RECORDS_IN, ...) / getRate(NUM_RECORDS_OUT, ...) calls in isProcessingBacklog, evaluateMetrics, computeEdgeOutputRatio, and computeEdgeDataRate with getAverageWithRateFallback(NUM_RECORDS_IN_PER_SECOND, NUM_RECORDS_IN, ...) / getAverageWithRateFallback(NUM_RECORDS_OUT_PER_SECOND, NUM_RECORDS_OUT, ...).
- Replaced getRate(LAG, ...) in computeTargetDataRate with getAverageRate(LAG, ...) to avoid diluted lag rate estimation when the metrics window is large.
JobAutoScalerImplTest: Fixed flaky testMetricReporting by injecting a deterministic Clock via autoscaler.setClock() to guarantee distinct timestamps across metric collections, avoiding the timestamp collision that caused the metrics history to stay at size 1.

Verifying this change

This change can be verified by existing unit tests after the flaky test fix. Additional UTs to be added to validate:

getAverageWithRateFallback returns the per-second average when available and falls back to getRate when it is not.
getAverageRate computes the correct average of per-interval deltas and is resilient to spikes diluted over a large window.
Non-source vertices correctly collect, store, and evaluate NUM_RECORDS_IN_PER_SECOND / NUM_RECORDS_OUT_PER_SECOND.

Does this pull request potentially affect one of the following parts:

Dependencies (does it add or upgrade a dependency): no
The public API, i.e., is any changes to the CustomResourceDescriptors: no
Core observer or reconciler logic that is regularly executed: no

Documentation

Does this pull request introduce a new feature? no
If yes, how is the feature documented? not applicable

…cond rate metrics, producing inaccurate scaling decisions

[FLINK-39306][flink-autoscaler] Non-source vertices do not use per-se…

23222d1

…cond rate metrics, producing inaccurate scaling decisions

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FLINK-39306][flink-autoscaler] Non-source vertices do not use per-second rate metrics, producing inaccurate scaling decisions#1078

[FLINK-39306][flink-autoscaler] Non-source vertices do not use per-second rate metrics, producing inaccurate scaling decisions#1078
Dennis-Mircea wants to merge 1 commit intoapache:mainfrom
Dennis-Mircea:FLINK-39306

Dennis-Mircea commented Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Dennis-Mircea commented Mar 23, 2026

What is the purpose of the change

Brief change log

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant